In today’s competitive retail industry, understanding what factors drive sales is essential for effective decision-making and strategic planning. This project aims to analyze historical sales data from one of the world’s largest retail chains to identify economic and seasonal variables that may influence business performance.
The dataset contains weekly sales records for multiple stores, along with additional information such as air temperature, fuel prices, holiday indicators, consumer price index (CPI), and unemployment rate. These variables allow us to explore patterns, assess external drivers of sales, and develop predictive models to forecast future performance under different economic scenarios.
By applying advanced data analysis techniques and machine learning models in R, this project seeks to provide actionable insights that can support inventory planning, marketing strategies, and resource allocation. The final results can help stakeholders anticipate changes in demand and make data-driven decisions that enhance operational efficiency and profitability.
The dataset used in this project contains historical weekly sales data from a major retail chain in the United States. Each observation corresponds to a specific store and week, accompanied by several external and economic variables that may influence sales behavior.
The dataset includes the following variables:
The dataset covers multiple years and stores, allowing for cross-sectional and time series analysis. It provides a rich context for exploring how economic and seasonal factors may influence retail performance. This makes it suitable for predictive modeling and scenario simulation in a business decision-making context.
# 📁 Importar dataset
data <- read.csv("Walmart_Sales.csv")
# 👀 Vista rápida del dataset
head(data)
str(data)
## 'data.frame': 6435 obs. of 8 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
## $ Weekly_Sales: num 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: int 0 1 0 0 0 0 0 0 0 0 ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
summary(data)
## Store Date Weekly_Sales Holiday_Flag
## Min. : 1 Length:6435 Min. : 209986 Min. :0.00000
## 1st Qu.:12 Class :character 1st Qu.: 553350 1st Qu.:0.00000
## Median :23 Mode :character Median : 960746 Median :0.00000
## Mean :23 Mean :1046965 Mean :0.06993
## 3rd Qu.:34 3rd Qu.:1420159 3rd Qu.:0.00000
## Max. :45 Max. :3818686 Max. :1.00000
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
# 🔄 Convertir la columna de fecha
data$Date <- dmy(data$Date) # Usa lubridate para formato DD-MM-YYYY
# ➕ Crear nuevas variables: Año, Mes y Semana
data <- data %>%
mutate(
Year = year(Date),
Month = month(Date, label = TRUE),
Week = week(Date)
)
data$Holiday_Flag <- as.factor(data$Holiday_Flag)
data$Store <- as.factor(data$Store)
# 💡 Verificar si hay valores NA
colSums(is.na(data))
## Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price
## 0 0 0 0 0 0
## CPI Unemployment Year Month Week
## 0 0 0 0 0
# 📈 Tendencia general de ventas en el tiempo
ggplot(data, aes(x = Date, y = Weekly_Sales)) +
geom_line(color = "steelblue") +
labs(title = "Weekly Sales Over Time", x = "Date", y = "Weekly Sales")
# 📊 Boxplot: ventas en semanas con vs sin feriados
ggplot(data, aes(x = as.factor(Holiday_Flag), y = Weekly_Sales)) +
geom_boxplot(fill = "darkorange") +
labs(title = "Sales Distribution - Holiday vs Non-Holiday",
x = "Holiday (0 = No, 1 = Yes)", y = "Weekly Sales")
# 🔍 Correlación entre variables numéricas
data_num <- data %>%
select(Weekly_Sales, Temperature, Fuel_Price, CPI, Unemployment)
ggpairs(data_num)
## Warning in geom_point(): All aesthetics have length 1, but the data has 25 rows.
## ℹ Please consider using `annotate()` or provide this layer with data containing
## a single row.
# Histograma con curva de densidad
ggplot(data, aes(x = Weekly_Sales)) +
geom_histogram(fill = "orange", bins = 30, color = "black") +
geom_density(color = "darkorange", size = 1.2) +
labs(title = "Distribution of Weekly Sales",
x = "Weekly Sales", y = "Frequency") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Cálculo de promedio por tienda
avg_sales <- data %>%
group_by(Store) %>%
summarise(Average_Weekly_Sales = mean(Weekly_Sales))
# Gráfico de barras
ggplot(avg_sales, aes(x = factor(Store), y = Average_Weekly_Sales)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Average Weekly Sales per Store",
x = "Store", y = "Average Weekly Sales") +
theme_minimal()
# Agrupar por semana
sales_by_date <- data %>%
group_by(Date) %>%
summarise(Total_Weekly_Sales = sum(Weekly_Sales))
# Gráfico de línea
ggplot(sales_by_date, aes(x = Date, y = Total_Weekly_Sales)) +
geom_line(color = "orange", size = 1) +
labs(title = "Total Weekly Sales Over Time",
x = "Date", y = "Total Weekly Sales") +
theme_minimal()
# Boxplot de ventas según Holiday_Flag
ggplot(data, aes(x = factor(Holiday_Flag), y = Weekly_Sales)) +
geom_boxplot(fill = c("goldenrod", "orangered")) +
labs(title = "Weekly Sales: Holiday vs Non-Holiday Weeks",
x = "Holiday (0 = No, 1 = Yes)", y = "Weekly Sales") +
theme_minimal()
# Gráficos individuales
ggplot(data, aes(x = Temperature, y = Weekly_Sales)) +
geom_point(color = "darkorange") +
labs(title = "Sales vs Temperature") + theme_minimal()
ggplot(data, aes(x = Fuel_Price, y = Weekly_Sales)) +
geom_point(color = "darkorange") +
labs(title = "Sales vs Fuel Price") + theme_minimal()
ggplot(data, aes(x = CPI, y = Weekly_Sales)) +
geom_point(color = "darkorange") +
labs(title = "Sales vs CPI") + theme_minimal()
ggplot(data, aes(x = Unemployment, y = Weekly_Sales)) +
geom_point(color = "darkorange") +
labs(title = "Sales vs Unemployment") + theme_minimal()
# Seleccionar solo variables numéricas
numeric_vars <- data %>%
select(Weekly_Sales, Temperature, Fuel_Price, CPI, Unemployment)
# Calcular correlación
cor_matrix <- cor(numeric_vars)
# Heatmap con corrplot
corrplot(cor_matrix, method = "color", type = "upper",
addCoef.col = "black", tl.col = "black", number.cex = 0.8)
# Modelo de regresión lineal múltiple
modelo <- lm(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment + Holiday_Flag, data = data)
# Resumen del modelo
summary(modelo)
##
## Call:
## lm(formula = Weekly_Sales ~ Temperature + Fuel_Price + CPI +
## Unemployment + Holiday_Flag, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1022429 -478555 -117266 397246 2800620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1726523.4 79763.5 21.646 < 2e-16 ***
## Temperature -724.2 400.5 -1.808 0.07060 .
## Fuel_Price -10167.9 15762.8 -0.645 0.51891
## CPI -1598.9 195.1 -8.194 3.02e-16 ***
## Unemployment -41552.3 3972.7 -10.460 < 2e-16 ***
## Holiday_Flag1 74891.7 27639.3 2.710 0.00675 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 557400 on 6429 degrees of freedom
## Multiple R-squared: 0.02544, Adjusted R-squared: 0.02469
## F-statistic: 33.57 on 5 and 6429 DF, p-value: < 2.2e-16
# Instalar si no tienes el paquete
# install.packages("car")
vif(modelo)
## Temperature Fuel_Price CPI Unemployment Holiday_Flag
## 1.130023 1.084282 1.221478 1.150241 1.029239
# Graficar residuos y normalidad
par(mfrow = c(2, 2))
plot(modelo)
par(mfrow = c(1, 1))
# Predecir con el mismo dataset
data$Predicted_Sales <- predict(modelo, newdata = data)
# Comparar real vs predicho
ggplot(data, aes(x = Weekly_Sales, y = Predicted_Sales)) +
geom_point(color = "steelblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
labs(title = "Actual vs Predicted Sales",
x = "Actual Sales", y = "Predicted Sales") +
theme_minimal()
# Dividir en train/test
set.seed(123)
index <- createDataPartition(data$Weekly_Sales, p = 0.8, list = FALSE)
train <- data[index, ]
test <- data[-index, ]
# Entrenar árbol
modelo_tree <- rpart(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment + Holiday_Flag + Month + Store,
data = train, method = "anova")
# Graficar árbol
rpart.plot(modelo_tree)
# Predicción
pred_tree <- predict(modelo_tree, newdata = test)
# Evaluar
RMSE_tree <- RMSE(pred_tree, test$Weekly_Sales)
R2_tree <- R2(pred_tree, test$Weekly_Sales)
cat("Tree RMSE:", RMSE_tree, " | R²:", R2_tree, "\n")
## Tree RMSE: 175367.2 | R²: 0.9018235
modelo_rf <- randomForest(Weekly_Sales ~ Temperature + Fuel_Price + CPI + Unemployment + Holiday_Flag + Month + Store,
data = train, ntree = 500, importance = TRUE)
# Importancia de variables
varImpPlot(modelo_rf)
# Predicción
pred_rf <- predict(modelo_rf, newdata = test)
# Evaluación
RMSE_rf <- RMSE(pred_rf, test$Weekly_Sales)
R2_rf <- R2(pred_rf, test$Weekly_Sales)
cat("Random Forest RMSE:", RMSE_rf, " | R²:", R2_rf, "\n")
## Random Forest RMSE: 117950.1 | R²: 0.9569226
# Convertir data a matriz numérica
train_matrix <- model.matrix(Weekly_Sales ~ . -Date, data = train)[, -1]
test_matrix <- model.matrix(Weekly_Sales ~ . -Date, data = test)[, -1]
dtrain <- xgb.DMatrix(data = train_matrix, label = train$Weekly_Sales)
dtest <- xgb.DMatrix(data = test_matrix, label = test$Weekly_Sales)
# Entrenar modelo
params <- list(objective = "reg:squarederror", eval_metric = "rmse")
modelo_xgb <- xgboost(params = params, data = dtrain, nrounds = 100, verbose = 0)
# Predicción
pred_xgb <- predict(modelo_xgb, newdata = dtest)
# Evaluación
RMSE_xgb <- RMSE(pred_xgb, test$Weekly_Sales)
R2_xgb <- R2(pred_xgb, test$Weekly_Sales)
cat("XGBoost RMSE:", RMSE_xgb, " | R²:", R2_xgb, "\n")
## XGBoost RMSE: 101250.2 | R²: 0.9677056
# Crear un dataframe resumen de métricas
model_performance <- tibble(
Model = c("Linear Regression", "Decision Tree", "Random Forest", "XGBoost"),
RMSE = c(RMSE(predict(modelo, newdata = test), test$Weekly_Sales),
RMSE_tree,
RMSE_rf,
RMSE_xgb),
R2 = c(R2(predict(modelo, newdata = test), test$Weekly_Sales),
R2_tree,
R2_rf,
R2_xgb)
)
# Gráfico comparativo - RMSE
ggplot(model_performance, aes(x = Model, y = RMSE, fill = Model)) +
geom_bar(stat = "identity") +
labs(title = "Model Comparison - RMSE", y = "RMSE", x = "") +
theme_minimal() +
theme(legend.position = "none")
# Gráfico comparativo - R²
ggplot(model_performance, aes(x = Model, y = R2, fill = Model)) +
geom_bar(stat = "identity") +
labs(title = "Model Comparison - R²", y = "R²", x = "") +
theme_minimal() +
theme(legend.position = "none")
# Tabla con kable para portafolio RMarkdown
model_performance %>%
mutate(RMSE = round(RMSE, 2),
R2 = round(R2, 4)) %>%
kable(caption = "Model Performance Comparison", align = "c") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Model | RMSE | R2 |
|---|---|---|
| Linear Regression | 552282.9 | 0.0261 |
| Decision Tree | 175367.2 | 0.9018 |
| Random Forest | 117950.1 | 0.9569 |
| XGBoost | 101250.2 | 0.9677 |
# Tomar una muestra real del test set como base
whatif_data <- test[1:10, ]
# Simular un aumento del precio del combustible
whatif_data$Fuel_Price <- whatif_data$Fuel_Price + 0.50
# Simular una disminución de temperatura
whatif_data$Temperature <- whatif_data$Temperature - 5
# Predicción bajo escenario modificado
pred_whatif_rf <- predict(modelo_rf, newdata = whatif_data)
# Comparar con predicción original
pred_original_rf <- predict(modelo_rf, newdata = test[1:10, ])
# Crear tabla comparativa
whatif_comparison <- tibble(
Store = test$Store[1:10],
Temperature_Original = test$Temperature[1:10],
Fuel_Price_Original = test$Fuel_Price[1:10],
Predicted_Original = round(pred_original_rf, 2),
Temperature_Simulated = whatif_data$Temperature,
Fuel_Price_Simulated = whatif_data$Fuel_Price,
Predicted_Simulated = round(pred_whatif_rf, 2),
Difference = round(pred_whatif_rf - pred_original_rf, 2)
)
print(whatif_comparison)
## # A tibble: 10 × 8
## Store Temperature_Original Fuel_Price_Original Predicted_Original
## <fct> <dbl> <dbl> <dbl>
## 1 1 46.5 2.62 1475682.
## 2 1 74.8 2.85 1511195.
## 3 1 80.7 2.70 1465224.
## 4 1 80.4 2.67 1474595.
## 5 1 80.5 2.64 1447882.
## 6 1 87 2.69 1513607.
## 7 1 85.2 2.62 1475065.
## 8 1 51.4 2.77 1518765.
## 9 1 64.5 2.74 1644581.
## 10 1 48.3 2.98 1390005.
## # ℹ 4 more variables: Temperature_Simulated <dbl>, Fuel_Price_Simulated <dbl>,
## # Predicted_Simulated <dbl>, Difference <dbl>
ggplot(whatif_comparison, aes(x = Store)) +
geom_bar(stat = "identity", aes(y = Difference, fill = Difference > 0)) +
labs(title = "Impact of Fuel Price ↑ and Temperature ↓ on Predicted Sales",
y = "Change in Predicted Sales", x = "Store") +
theme_minimal() +
scale_fill_manual(values = c("red", "green"), labels = c("Decrease", "Increase"))